{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Regression\n",
    "\n",
    "In this notebook, we present the metrics that can be used in regression.\n",
    "\n",
    "A set of metrics are dedicated to regression. Indeed, classification metrics\n",
    "cannot be used to evaluate the generalization performance of regression models\n",
    "because there is a fundamental difference between their target type `target`:\n",
    "it is a continuous variable in regression, while a discrete variable in\n",
    "classification.\n",
    "\n",
    "We use the Ames housing dataset. The goal is to predict the price of houses in\n",
    "the city of Ames, Iowa. As with classification, we only use a single\n",
    "train-test split to focus solely on the regression metrics."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "import numpy as np\n",
    "\n",
    "ames_housing = pd.read_csv(\"../datasets/house_prices.csv\")\n",
    "data = ames_housing.drop(columns=\"SalePrice\")\n",
    "target = ames_housing[\"SalePrice\"]\n",
    "data = data.select_dtypes(np.number)\n",
    "target /= 1000"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<div class=\"admonition note alert alert-info\">\n",
    "<p class=\"first admonition-title\" style=\"font-weight: bold;\">Note</p>\n",
    "<p class=\"last\">If you want a deeper overview regarding this dataset, you can refer to the\n",
    "Appendix - Datasets description section at the end of this MOOC.</p>\n",
    "</div>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's start by splitting our dataset intro a train and test set."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.model_selection import train_test_split\n",
    "\n",
    "data_train, data_test, target_train, target_test = train_test_split(\n",
    "    data, target, shuffle=True, random_state=0\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Some machine learning models are designed to be solved as an optimization\n",
    "problem: minimizing an error (also known as the loss function) using a\n",
    "training set. A basic loss function used in regression is the mean squared\n",
    "error (MSE). Thus, this metric is sometimes used to evaluate the model since\n",
    "it is optimized by said model.\n",
    "\n",
    "We give an example using a linear regression model."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.linear_model import LinearRegression\n",
    "from sklearn.metrics import mean_squared_error\n",
    "\n",
    "regressor = LinearRegression()\n",
    "regressor.fit(data_train, target_train)\n",
    "target_predicted = regressor.predict(data_train)\n",
    "\n",
    "print(\n",
    "    \"Mean squared error on the training set: \"\n",
    "    f\"{mean_squared_error(target_train, target_predicted):.3f}\"\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Our linear regression model is minimizing the mean squared error on the\n",
    "training set. It means that there is no other set of coefficients which\n",
    "decreases the error.\n",
    "\n",
    "Then, we can compute the mean squared error on the test set."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "target_predicted = regressor.predict(data_test)\n",
    "\n",
    "print(\n",
    "    \"Mean squared error on the testing set: \"\n",
    "    f\"{mean_squared_error(target_test, target_predicted):.3f}\"\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The raw MSE can be difficult to interpret. One way is to rescale the MSE by\n",
    "the variance of the target. This score is known as the $R^2$ also called the\n",
    "[coefficient of\n",
    "determination](https://scikit-learn.org/stable/modules/model_evaluation.html#r2-score-the-coefficient-of-determination).\n",
    "Indeed, this is the default score used in scikit-learn by calling the method\n",
    "`score`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "regressor.score(data_test, target_test)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The $R^2$ score represents the proportion of variance of the target that is\n",
    "explained by the independent variables in the model. The best score possible\n",
    "is 1 but there is no lower bound. However, a model that predicts the [expected\n",
    "value](https://en.wikipedia.org/wiki/Expected_value) of the target would get a\n",
    "score of 0."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.dummy import DummyRegressor\n",
    "\n",
    "dummy_regressor = DummyRegressor(strategy=\"mean\")\n",
    "dummy_regressor.fit(data_train, target_train)\n",
    "print(\n",
    "    \"R2 score for a regressor predicting the mean:\"\n",
    "    f\"{dummy_regressor.score(data_test, target_test):.3f}\"\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The $R^2$ score gives insight into the quality of the model's fit. However,\n",
    "this score cannot be compared from one dataset to another and the value\n",
    "obtained does not have a meaningful interpretation relative the original unit\n",
    "of the target. If we wanted to get an interpretable score, we would be\n",
    "interested in the median or mean absolute error."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.metrics import mean_absolute_error\n",
    "\n",
    "target_predicted = regressor.predict(data_test)\n",
    "print(\n",
    "    \"Mean absolute error: \"\n",
    "    f\"{mean_absolute_error(target_test, target_predicted):.3f} k$\"\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "By computing the mean absolute error, we can interpret that our model is\n",
    "predicting on average 22.6 k\\\\$ away from the true house price. A disadvantage\n",
    "of this metric is that the mean can be impacted by large error. For some\n",
    "applications, we might not want these large errors to have such a big\n",
    "influence on our metric. In this case we can use the median absolute error."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.metrics import median_absolute_error\n",
    "\n",
    "print(\n",
    "    \"Median absolute error: \"\n",
    "    f\"{median_absolute_error(target_test, target_predicted):.3f} k$\"\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The mean absolute error (or median absolute error) still have a known\n",
    "limitation: committing an error of 50 k\\\\$ for a house valued at 50 k\\\\$ has the\n",
    "same impact than committing an error of 50 k\\\\$ for a house valued at 500 k\\\\$.\n",
    "Indeed, the mean absolute error is not relative.\n",
    "\n",
    "The mean absolute percentage error introduce this relative scaling."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.metrics import mean_absolute_percentage_error\n",
    "\n",
    "print(\n",
    "    \"Mean absolute percentage error: \"\n",
    "    f\"{mean_absolute_percentage_error(target_test, target_predicted) * 100:.3f} %\"\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In addition to using metrics, we can visualize the results by plotting the\n",
    "predicted values versus the true values.\n",
    "\n",
    "In an ideal scenario where all variations in the target could be perfectly\n",
    "explained by the obseved features (i.e. without any unobserved factors of\n",
    "variations), and we have chosen an optimal model, we would expect all\n",
    "predictions to fall along the diagonal line of the first plot below.\n",
    "\n",
    "In the real life, this is almost never the case: some unknown fraction of the\n",
    "variations in the target cannot be explained by variations in data: they stem\n",
    "from external factors not represented by the observed features.\n",
    "\n",
    "Therefore, the best we can hope for is that our model's predictions form a\n",
    "cloud of points symmetrically distributed around the diagonal line, ideally\n",
    "close enough to it for the model to be useful.\n",
    "\n",
    "To gain more insight, it can be helpful to plot the residuals, which represent\n",
    "the difference between the actual and predicted values, against the predicted\n",
    "values. This is shown in the second plot.\n",
    "\n",
    "Residual plots make it easier to assess if the residuals exhibit a variance\n",
    "independent of the target values or if there is any systematic bias of the\n",
    "model associated with the lowest or highest predicted values."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import matplotlib.pyplot as plt\n",
    "from sklearn.metrics import PredictionErrorDisplay\n",
    "\n",
    "fig, axs = plt.subplots(ncols=2, figsize=(13, 5))\n",
    "\n",
    "PredictionErrorDisplay.from_predictions(\n",
    "    y_true=target_test,\n",
    "    y_pred=target_predicted,\n",
    "    kind=\"actual_vs_predicted\",\n",
    "    scatter_kwargs={\"alpha\": 0.5},\n",
    "    ax=axs[0],\n",
    ")\n",
    "axs[0].axis(\"square\")\n",
    "axs[0].set_xlabel(\"Predicted values (k$)\")\n",
    "axs[0].set_ylabel(\"True values (k$)\")\n",
    "\n",
    "PredictionErrorDisplay.from_predictions(\n",
    "    y_true=target_test,\n",
    "    y_pred=target_predicted,\n",
    "    kind=\"residual_vs_predicted\",\n",
    "    scatter_kwargs={\"alpha\": 0.5},\n",
    "    ax=axs[1],\n",
    ")\n",
    "axs[1].axis(\"square\")\n",
    "axs[1].set_xlabel(\"Predicted values (k$)\")\n",
    "axs[1].set_ylabel(\"Residual values (k$)\")\n",
    "\n",
    "_ = fig.suptitle(\n",
    "    \"Regression using a model\\nwithout target transformation\", y=1.1\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "On these plots, we see that our model tends to under-estimate the price of the\n",
    "house both for the lowest and large True price values. This means that the\n",
    "residuals still hold some **structure typically visible as the \"banana\" or\n",
    "\"smile\" shape of the residual plot**. This is often a clue that our model\n",
    "could be improved, either by transforming the features, the target or\n",
    "sometimes changing the model type or its parameters. In this case let's try to\n",
    "see if the model would benefit from a target transformation that monotonically\n",
    "reshapes the target variable to follow a normal distribution."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.preprocessing import QuantileTransformer\n",
    "from sklearn.compose import TransformedTargetRegressor\n",
    "\n",
    "transformer = QuantileTransformer(\n",
    "    n_quantiles=900, output_distribution=\"normal\"\n",
    ")\n",
    "model_transformed_target = TransformedTargetRegressor(\n",
    "    regressor=regressor, transformer=transformer\n",
    ")\n",
    "model_transformed_target.fit(data_train, target_train)\n",
    "target_predicted = model_transformed_target.predict(data_test)\n",
    "\n",
    "fig, axs = plt.subplots(ncols=2, figsize=(13, 5))\n",
    "\n",
    "PredictionErrorDisplay.from_predictions(\n",
    "    y_true=target_test,\n",
    "    y_pred=target_predicted,\n",
    "    kind=\"actual_vs_predicted\",\n",
    "    scatter_kwargs={\"alpha\": 0.5},\n",
    "    ax=axs[0],\n",
    ")\n",
    "axs[0].axis(\"square\")\n",
    "axs[0].set_xlabel(\"Predicted values (k$)\")\n",
    "axs[0].set_ylabel(\"True values (k$)\")\n",
    "\n",
    "PredictionErrorDisplay.from_predictions(\n",
    "    y_true=target_test,\n",
    "    y_pred=target_predicted,\n",
    "    kind=\"residual_vs_predicted\",\n",
    "    scatter_kwargs={\"alpha\": 0.5},\n",
    "    ax=axs[1],\n",
    ")\n",
    "axs[1].axis(\"square\")\n",
    "axs[1].set_xlabel(\"Predicted values (k$)\")\n",
    "axs[1].set_ylabel(\"Residual values (k$)\")\n",
    "\n",
    "_ = fig.suptitle(\n",
    "    \"Regression using a model that\\ntransforms the target before fitting\",\n",
    "    y=1.1,\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The model with the transformed target seems to exhibit fewer structure in its\n",
    "residuals: over-estimation and under-estimation errors seems to be more\n",
    "balanced.\n",
    "\n",
    "We can confirm this by computing the previously mentioned metrics and observe\n",
    "that they all improved w.r.t. the linear regression model without the target\n",
    "transformation."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "print(\n",
    "    \"Mean absolute error: \"\n",
    "    f\"{mean_absolute_error(target_test, target_predicted):.3f} k$\"\n",
    ")\n",
    "print(\n",
    "    \"Median absolute error: \"\n",
    "    f\"{median_absolute_error(target_test, target_predicted):.3f} k$\"\n",
    ")\n",
    "print(\n",
    "    \"Mean absolute percentage error: \"\n",
    "    f\"{mean_absolute_percentage_error(target_test, target_predicted):.2%}\"\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "While a common practice, performing such a target transformation for linear\n",
    "regression is often disapproved by statisticians. It is mathematically more\n",
    "justified to instead adapt the loss function of the regression model itself,\n",
    "for instance by fitting a\n",
    "[`PoissonRegressor`](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.PoissonRegressor.html)\n",
    "or a\n",
    "[`TweedieRegressor`](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.TweedieRegressor.html)\n",
    "model instead of `LinearRegression`. In particular those models indeed use an\n",
    "internal \"log link\" function that makes them more suited for this kind of\n",
    "positive-only target data distributions, but this analysis is beyond the scope\n",
    "of this MOOC.\n",
    "\n",
    "The interested readers are encouraged to learn more about those models, in\n",
    "particular by reading their respective docstrings and the linked sections\n",
    "in the scikit-learn user guide reachable from the links above."
   ]
  }
 ],
 "metadata": {
  "jupytext": {
   "main_language": "python"
  },
  "kernelspec": {
   "display_name": "Python 3",
   "name": "python3"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}